Exploratory Data Analysis for Machine Learning

Peer Review Project 1

This project uses the publically available Breast Cancer Wisconsin (Diagnostic) Data set hosted on Kaggle

https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34.


I intend to apply the steps taught in the last 2 weeks ie;

  1. Data retrieval
  2. Data cleaning (removing/imputing missing values and outliers if present)
  3. Checking features using plots/ visualisations
  4. Feature engineering such as transformations for linear regression modelling
  5. Hypothesis testing for features that are associated with Benign vs Malignant tissue

Module Imports

Importing the data

For purposes of prediction, unique features such as the ID will not be useful and can be dropped

Since there aren't any missing values or cateogorical variables other than the diagnosis in the dataset, we can look at some basic descriptive statistics

Whew, that's a lot of data. Maybe a pair plot with diagnosis as the hue can tell us more. Since the variables are so many, let's use only 6 of them for now.

There seems to be a linear relationship between dimensions and the diagnosis of the cancer. That would be an interesting hypothesis to test.

But we need to engineer these features into a normal distribution for linear regression modelling

All features except texture are skewed. So we shall apply Log transformation to them before hypothesis testing

Hypothesis testing

Ho - Nuclei with a perimeter mean greater than 4.5 are not malignant

H1 - Nuclei with a permieter mean greater than 4.5 are malignant

Such a low p-value leads us to reject the null hypothesis. Tissue with a mean perimeter above 4.5 are likely to be malignant.

Next steps;

  1. Check how other features correlate with the cancer being benign or malignant.
  2. This data set contained only 569 samples. More data would be needed to confirm or refute these findings
  3. A model can be trained to look for these key features in identifying benign or malignant cancer tissue.

Thank you!